In mathematics, in the areas of combinatorics and computer science, a Lyndon word is a string that is strictly smaller in lexicographic order than all of its rotations. Lyndon words are named after mathematician Roger Lyndon, who introduced them in 1954, calling them standard lexicographic sequences.[1]
Contents |
Several equivalent definitions are possible.
A k-ary Lyndon word of length n is an n-character string over an alphabet of size k, and which is the minimum element in the lexicographical ordering of all its rotations. Being the singularly smallest rotation implies that a Lyndon word differs from any of its non-trivial rotations, and is therefore aperiodic.[2]
Alternately, a Lyndon word has the property that, whenever it is split into two substrings, the left substring is always lexicographically less than the right substring. That is, if w is a Lyndon word, and w = uv is any factorization into two substrings, with u and v understood to be non-empty, then u < v. This definition implies that w is a Lyndon word if and only if there exist Lyndon words u and v such that u < v and w = uv.[3] Although there may be more than one choice of u and v with this property, there is a particular choice, called the standard factorization, in which v is as long as possible.[4]
The Lyndon words over the two-symbol binary alphabet {0,1}, sorted by length and then lexicographically within each length class, form an infinite sequence that begins
Here, ε denotes the empty string. The first string that does not belong to this sequence, "00", is omitted because it is periodic (it consists of two repetitions of the substring "0"); the second omitted string, "10", is aperiodic but is not minimal in its permutation class as it can be cyclically permuted to the smaller string "01".
The numbers of binary Lyndon words of each length, starting with length zero, form the integer sequence
Lyndon words correspond to aperiodic necklace class representatives and can thus be counted with Moreau's necklace-counting function.[2][5]
Duval (1988) provides an efficient algorithm for listing the Lyndon words of length at most n with a given alphabet size s in lexicographic order. If w is one of the words in the sequence, then the next word after w can be found by the following steps:
The worst-case time to generate the successor of a word w by this procedure is O(n). However, if the words being generated are stored in an array of length n, and the construction of x from w is performed by adding symbols onto the end of w rather than by making a new copy of w, then the average time to generate the successor of w (assuming each word is equally likely) is constant. Therefore, the sequence of all Lyndon words of length at most n can be generated in time proportional to the length of the sequence.[6] A constant fraction of the words in this sequence have length exactly n, so the same procedure can be used to efficiently generate words of length exactly n or words whose length divides n, by filtering out the generated words that do not fit these criteria.
According to the Chen–Fox–Lyndon theorem, every string may be formed in a unique way by concatenating a sequence of Lyndon words, in such a way that the words in the sequence are nonincreasing lexicographically.[7] The final Lyndon word in this sequence is the lexicographically smallest suffix of the given string.[8] A factorization into a nonincreasing sequence of Lyndon words (a so-called Lyndon factorization) may be constructed in linear time.[8] Lyndon factorizations may be used as part of a bijective variant of the Burrows–Wheeler transform for data compression,[9] and in algorithms for digital geometry.[10]
Duval (1983) developed an algorithm for standard factorization that runs in linear time and constant space. It iterates over a string trying to find as long Lyndon word as possible. When it founds one, it adds it to the result list and proceeds to search in the remaining part of string. Resulting list of strings is standard factorization of a given string. More formal description of algorithm follows.
Given a string S of length N, one should proceed with the following steps:
If one concatenates together, in lexicographic order, all the Lyndon words that have length dividing a given number n, the result is a de Bruijn sequence, a circular sequence of symbols such that each possible length-n sequence appears exactly once as one of its contiguous subsequences. For example, the concatenation of the binary Lyndon words whose length divides four is
This construction, together with the efficient generation of Lyndon words, provides an efficient method for constructing a particular de Bruijn sequence in linear time and logarithmic space.[11]
Lyndon words have an application to the description of free Lie algebras, in constructing a basis for the homogeneous part of a given degree; this was Lyndon's original motivation for introducing these words.[3] Lyndon words may be understood as a special case of Hall sets.[3]
A theorem of Radford states that the algebra of polynomials of Lyndon words with rational coefficients is a shuffle algebra; that is, they form an algebra over a ring, with multiplication taken to be the shuffle operator.